Low-cost Hardware Fault Detection and Diagnosis for Multicore Systems

نویسندگان

  • Siva Kumar
  • Sastry Hari
  • Man-Lap Li
  • Pradeep Ramachandran
  • Byn Choi
  • Sarita V. Adve
چکیده

Continued technology scaling is resulting in systems with billions of devices. Consequently, these devices are are prone to failures from various sources resulting in a growing reliability threat. As this reliability problem is expected to affect the broad computing market, traditional solutions involving high redundancy, or piecemeal solutions targeting specific failure modes will no longer be viable. Recently, researchers have proposed SWAT, a low-cost solution that handles hardware faults by treating the resulting software anomalies. SWAT uses near-zero cost always-on symptom monitors for error detection and incurs higher diagnosis cost in the relatively rare event of a fault. The SWAT approach has been shown to be highly effective for single-threaded applications. It is, however, unclear whether such an approach would remain effective for the widely used multithreaded software on multicore systems. This paper presents mSWAT – SWAT for multicore architectures running multithreaded software, focusing on detection and diagnosis. For detection, we use the symptom-based detectors in SWAT and show that symptom detection results in a very low Silent Data Corruption (SDC) rates for both permanent and transient hardware faults. The challenge in this new setting is in fault diagnosis, owing to the fault propagating out of the faulty core, and causing symptoms from fault-free cores. To this end, we propose a novel permanent fault diagnosis mechanism that identifies the faulty core even under such circumstances. Our diagnosis procedure (1) captures the trace of the in-situ execution that activates the underlying fault, (2) deterministically replays the execution on different cores, and (3) compares the execution traces to isolate the faulty core. Our results show that this diagnosis technique, successfully diagnoses 95% of the detected faults while incurring minimal hardware overheads.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Developing A Fault Diagnosis Approach Based On Artificial Neural Network And Self Organization Map For Occurred ADSL Faults

Telecommunication companies have received a great deal of research attention, which have many advantages such as low cost, higher qualification, simple installation and maintenance, and high reliability. However, the using of technical maintenance approaches in Telecommunication companies could improve system reliability and users' satisfaction from Asymmetric digital subscriber line (ADSL) ser...

متن کامل

Reversible Logic Multipliers: Novel Low-cost Parity-Preserving Designs

Reversible logic is one of the new paradigms for power optimization that can be used instead of the current circuits. Moreover, the fault-tolerance capability in the form of error detection or error correction is a vital aspect for current processing systems. In this paper, as the multiplication is an important operation in computing systems, some novel reversible multiplier designs are propose...

متن کامل

SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters

The challenge of improving the performance of current processors is achieved by increasing the integration scale. This carries a growing vulnerability to transient faults, which increase their impact on multicore clusters running large scientific parallel applications. The requirement for enhancing the reliability of these systems, coupled with the high cost of rerunning the application from th...

متن کامل

FUZZY BASED FAULT DETECTION AND CONTROL FOR 6/4 SWITCHED RELUCTANCE MOTOR

Prompt detection and diagnosis of faults in industrial systems areessential to minimize the production losses, increase the safety of the operatorand the equipment. Several techniques are available in the literature to achievethese objectives. This paper presents fuzzy based control and fault detection for a6/4 switched reluctance motor. The fuzzy logic control performs like a classicalproporti...

متن کامل

Variable Speed Wind Turbine DFIG Back to Back Converters Open-Circuit Fault Diagnosis by Using of Combiniation Signal-Based and Model-Based Methodes

Condition monitoring (CM) and Fault Detection (FD) of wind turbine lead to increase in reliability and availability of turbine. IGBT open circuit of wind turbine converter will bring about depletion in output current of converter and as a result, reduction in production of wind turbine power. In this research, back to back converter IGBT open - gate fault for wind turbine based on DFIG is detec...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009